<!DOCTYPE html>
<html>
<!-- Created by GNU Texinfo 7.1.1, https://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<!-- Copyright © 1988-2023 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
any later version published by the Free Software Foundation; with the
Invariant Sections being "Funding Free Software", the Front-Cover
Texts being (a) (see below), and with the Back-Cover Texts being (b)
(see below).  A copy of the license is included in the section entitled
"GNU Free Documentation License".

(a) The FSF's Front-Cover Text is:

A GNU Manual

(b) The FSF's Back-Cover Text is:

You have freedom to copy and modify this GNU Manual, like GNU
     software.  Copies published by the Free Software Foundation raise
     funds for GNU development. -->
<title>Analyzer Internals (GNU Compiler Collection (GCC) Internals)</title>

<meta name="description" content="Analyzer Internals (GNU Compiler Collection (GCC) Internals)">
<meta name="keywords" content="Analyzer Internals (GNU Compiler Collection (GCC) Internals)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<meta name="viewport" content="width=device-width,initial-scale=1">

<link href="index.html" rel="start" title="Top">
<link href="Option-Index.html" rel="index" title="Option Index">
<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Static-Analyzer.html" rel="up" title="Static Analyzer">
<link href="Debugging-the-Analyzer.html" rel="next" title="Debugging the Analyzer">
<style type="text/css">
<!--
a.copiable-link {visibility: hidden; text-decoration: none; line-height: 0em}
div.example {margin-left: 3.2em}
span:hover a.copiable-link {visibility: visible}
ul.mark-bullet {list-style-type: disc}
-->
</style>


</head>

<body lang="en">
<div class="section-level-extent" id="Analyzer-Internals">
<div class="nav-panel">
<p>
Next: <a href="Debugging-the-Analyzer.html" accesskey="n" rel="next">Debugging the Analyzer</a>, Up: <a href="Static-Analyzer.html" accesskey="u" rel="up">Static Analyzer</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Option-Index.html" title="Index" rel="index">Index</a>]</p>
</div>
<hr>
<h3 class="section" id="Analyzer-Internals-1"><span>27.1 Analyzer Internals<a class="copiable-link" href="#Analyzer-Internals-1"> &para;</a></span></h3>
<a class="index-entry-id" id="index-analyzer_002c-internals"></a>
<a class="index-entry-id" id="index-static-analyzer_002c-internals"></a>

<ul class="mini-toc">
<li><a href="#Overview-2" accesskey="1">Overview</a></li>
<li><a href="#Graphs" accesskey="2">Graphs</a></li>
<li><a href="#State-Tracking" accesskey="3">State Tracking</a></li>
<li><a href="#Region-Model" accesskey="4">Region Model</a></li>
<li><a href="#Analyzer-Paths" accesskey="5">Analyzer Paths</a></li>
<li><a href="#Limitations" accesskey="6">Limitations</a></li>
</ul>
<div class="subsection-level-extent" id="Overview-2">
<h4 class="subsection"><span>27.1.1 Overview<a class="copiable-link" href="#Overview-2"> &para;</a></span></h4>

<p>The analyzer implementation works on the gimple-SSA representation.
(I chose this in the hopes of making it easy to work with LTO to
do whole-program analysis).
</p>
<p>The implementation is read-only: it doesn&rsquo;t attempt to change anything,
just emit warnings.
</p>
<p>The gimple representation can be seen using <samp class="option">-fdump-ipa-analyzer</samp>.
</p><blockquote class="quotation">
<p><b class="b">Tip:</b> If the analyzer ICEs before this is written out, one workaround is to use
<samp class="option">--param=analyzer-bb-explosion-factor=0</samp> to force the analyzer
to bail out after analyzing the first basic block.
</p></blockquote>

<p>First, we build a <code class="code">supergraph</code> which combines the callgraph and all
of the CFGs into a single directed graph, with both interprocedural and
intraprocedural edges.  The nodes and edges in the supergraph are called
&ldquo;supernodes&rdquo; and &ldquo;superedges&rdquo;, and often referred to in code as
<code class="code">snodes</code> and <code class="code">sedges</code>.  Basic blocks in the CFGs are split at
interprocedural calls, so there can be more than one supernode per
basic block.  Most statements will be in just one supernode, but a call
statement can appear in two supernodes: at the end of one for the call,
and again at the start of another for the return.
</p>
<p>The supergraph can be seen using <samp class="option">-fdump-analyzer-supergraph</samp>.
</p>
<p>We then build an <code class="code">analysis_plan</code> which walks the callgraph to
determine which calls might be suitable for being summarized (rather
than fully explored) and thus in what order to explore the functions.
</p>
<p>Next is the heart of the analyzer: we use a worklist to explore state
within the supergraph, building an &quot;exploded graph&quot;.
Nodes in the exploded graph correspond to &lt;point,&nbsp;<!-- /@w -->state&gt; pairs, as in
     &quot;Precise Interprocedural Dataflow Analysis via Graph Reachability&quot;
     (Thomas Reps, Susan Horwitz and Mooly Sagiv).
</p>
<p>We reuse nodes for &lt;point, state&gt; pairs we&rsquo;ve already seen, and avoid
tracking state too closely, so that (hopefully) we rapidly converge
on a final exploded graph, and terminate the analysis.  We also bail
out if the number of exploded &lt;end-of-basic-block, state&gt; nodes gets
larger than a particular multiple of the total number of basic blocks
(to ensure termination in the face of pathological state-explosion
cases, or bugs).  We also stop exploring a point once we hit a limit
of states for that point.
</p>
<p>We can identify problems directly when processing a &lt;point,&nbsp;<!-- /@w -->state&gt;
instance.  For example, if we&rsquo;re finding the successors of
</p>
<div class="example smallexample">
<pre class="example-preformatted">   &lt;point: before-stmt: &quot;free (ptr);&quot;,
    state: {&quot;ptr&quot;: freed}&gt;
</pre></div>

<p>then we can detect a double-free of &quot;ptr&quot;.  We can then emit a path
to reach the problem by finding the simplest route through the graph.
</p>
<p>Program points in the analysis are much more fine-grained than in the
CFG and supergraph, with points (and thus potentially exploded nodes)
for various events, including before individual statements.
By default the exploded graph merges multiple consecutive statements
in a supernode into one exploded edge to minimize the size of the
exploded graph.  This can be suppressed via
<samp class="option">-fanalyzer-fine-grained</samp>.
The fine-grained approach seems to make things simpler and more debuggable
that other approaches I tried, in that each point is responsible for one
thing.
</p>
<p>Program points in the analysis also have a &quot;call string&quot; identifying the
stack of callsites below them, so that paths in the exploded graph
correspond to interprocedurally valid paths: we always return to the
correct call site, propagating state information accordingly.
We avoid infinite recursion by stopping the analysis if a callsite
appears more than <code class="code">analyzer-max-recursion-depth</code> in a callstring
(defaulting to 2).
</p>
</div>
<div class="subsection-level-extent" id="Graphs">
<h4 class="subsection"><span>27.1.2 Graphs<a class="copiable-link" href="#Graphs"> &para;</a></span></h4>

<p>Nodes and edges in the exploded graph are called &ldquo;exploded nodes&rdquo; and
&ldquo;exploded edges&rdquo; and often referred to in the code as
<code class="code">enodes</code> and <code class="code">eedges</code> (especially when distinguishing them
from the <code class="code">snodes</code> and <code class="code">sedges</code> in the supergraph).
</p>
<p>Each graph numbers its nodes, giving unique identifiers - supernodes
are referred to throughout dumps in the form &lsquo;<samp class="samp">SN': <var class="var">index</var></samp>&rsquo; and
exploded nodes in the form &lsquo;<samp class="samp">EN: <var class="var">index</var></samp>&rsquo; (e.g. &lsquo;<samp class="samp">SN: 2</samp>&rsquo; and
&lsquo;<samp class="samp">EN:29</samp>&rsquo;).
</p>
<p>The supergraph can be seen using <samp class="option">-fdump-analyzer-supergraph-graph</samp>.
</p>
<p>The exploded graph can be seen using <samp class="option">-fdump-analyzer-exploded-graph</samp>
and other dump options.  Exploded nodes are color-coded in the .dot output
based on state-machine states to make it easier to see state changes at
a glance.
</p>
</div>
<div class="subsection-level-extent" id="State-Tracking">
<h4 class="subsection"><span>27.1.3 State Tracking<a class="copiable-link" href="#State-Tracking"> &para;</a></span></h4>

<p>There&rsquo;s a tension between:
</p><ul class="itemize mark-bullet">
<li>precision of analysis in the straight-line case, vs
</li><li>exponential blow-up in the face of control flow.
</li></ul>

<p>For example, in general, given this CFG:
</p>
<div class="example smallexample">
<pre class="example-preformatted">      A
     / \
    B   C
     \ /
      D
     / \
    E   F
     \ /
      G
</pre></div>

<p>we want to avoid differences in state-tracking in B and C from
leading to blow-up.  If we don&rsquo;t prevent state blowup, we end up
with exponential growth of the exploded graph like this:
</p>
<div class="example smallexample">
<pre class="example-preformatted">

           1:A
          /   \
         /     \
        /       \
      2:B       3:C
       |         |
      4:D       5:D        (2 exploded nodes for D)
     /   \     /   \
   6:E   7:F 8:E   9:F
    |     |   |     |
   10:G 11:G 12:G  13:G    (4 exploded nodes for G)

</pre></div>

<p>Similar issues arise with loops.
</p>
<p>To prevent this, we follow various approaches:
</p>
<ol class="enumerate" type="a" start="1">
<li> state pruning: which tries to discard state that won&rsquo;t be relevant
later on withing the function.
This can be disabled via <samp class="option">-fno-analyzer-state-purge</samp>.

</li><li> state merging.  We can try to find the commonality between two
program_state instances to make a third, simpler program_state.
We have two strategies here:

<ol class="enumerate">
<li> the worklist keeps new nodes for the same program_point together,
     and tries to merge them before processing, and thus before they have
     successors.  Hence, in the above, the two nodes for D (4 and 5) reach
     the front of the worklist together, and we create a node for D with
     the merger of the incoming states.

</li><li> try merging with the state of existing enodes for the program_point
     (which may have already been explored).  There will be duplication,
     but only one set of duplication; subsequent duplicates are more likely
     to hit the cache.  In particular, (hopefully) all merger chains are
     finite, and so we guarantee termination.
     This is intended to help with loops: we ought to explore the first
     iteration, and then have a &quot;subsequent iterations&quot; exploration,
     which uses a state merged from that of the first, to be more abstract.
  </li></ol>

<p>We avoid merging pairs of states that have state-machine differences,
as these are the kinds of differences that are likely to be most
interesting.  So, for example, given:
</p>
<div class="example smallexample">
<pre class="example-preformatted">      if (condition)
        ptr = malloc (size);
      else
        ptr = local_buf;

      .... do things with 'ptr'

      if (condition)
        free (ptr);

      ...etc
</pre></div>

<p>then we end up with an exploded graph that looks like this:
</p>
<div class="example smallexample">
<pre class="example-preformatted">

                   if (condition)
                     / T      \ F
            ---------          ----------
           /                             \
      ptr = malloc (size)             ptr = local_buf
          |                               |
      copy of                         copy of
        &quot;do things with 'ptr'&quot;          &quot;do things with 'ptr'&quot;
      with ptr: heap-allocated        with ptr: stack-allocated
          |                               |
      if (condition)                  if (condition)
          | known to be T                 | known to be F
      free (ptr);                         |
           \                             /
            -----------------------------
                         | ('ptr' is pruned, so states can be merged)
                        etc

</pre></div>

<p>where some duplication has occurred, but only for the places where the
the different paths are worth exploringly separately.
</p>
<p>Merging can be disabled via <samp class="option">-fno-analyzer-state-merge</samp>.
</p></li></ol>

</div>
<div class="subsection-level-extent" id="Region-Model">
<h4 class="subsection"><span>27.1.4 Region Model<a class="copiable-link" href="#Region-Model"> &para;</a></span></h4>

<p>Part of the state stored at a <code class="code">exploded_node</code> is a <code class="code">region_model</code>.
This is an implementation of the region-based ternary model described in
<a class="url" href="https://www.researchgate.net/publication/221430855_A_Memory_Model_for_Static_Analysis_of_C_Programs">&quot;A Memory Model for Static Analysis of C Programs&quot;</a>
(Zhongxing Xu, Ted Kremenek, and Jian Zhang).
</p>
<p>A <code class="code">region_model</code> encapsulates a representation of the state of
memory, with a <code class="code">store</code> recording a binding between <code class="code">region</code>
instances, to <code class="code">svalue</code> instances.  The bindings are organized into
clusters, where regions accessible via well-defined pointer arithmetic
are in the same cluster.  The representation is graph-like because values
can be pointers to regions.  It also stores a constraint_manager,
capturing relationships between the values.
</p>
<p>Because each node in the <code class="code">exploded_graph</code> has a <code class="code">region_model</code>,
and each of the latter is graph-like, the <code class="code">exploded_graph</code> is in some
ways a graph of graphs.
</p>
<p>Here&rsquo;s an example of printing a <code class="code">program_state</code>, showing the
<code class="code">region_model</code> within it, along with state for the <code class="code">malloc</code>
state machine.
</p>
<div class="example smallexample">
<pre class="example-preformatted">(gdb) call debug (*this)
rmodel:
stack depth: 1
  frame (index 0): frame: ‘test’@1
clusters within frame: ‘test’@1
  cluster for: ptr_3: &amp;HEAP_ALLOCATED_REGION(12)
m_called_unknown_fn: FALSE
constraint_manager:
  equiv classes:
  constraints:
malloc:
  0x2e89590: &amp;HEAP_ALLOCATED_REGION(12): unchecked ('ptr_3')
</pre></div>

<p>This is the state at the point of returning from <code class="code">calls_malloc</code> back
to <code class="code">test</code> in the following:
</p>
<div class="example smallexample">
<pre class="example-preformatted">void *
calls_malloc (void)
{
  void *result = malloc (1024);
  return result;
}

void test (void)
{
  void *ptr = calls_malloc ();
  /* etc.  */
}
</pre></div>

<p>Within the store, there is the cluster for <code class="code">ptr_3</code> within the frame
for <code class="code">test</code>, where the whole cluster is bound to a pointer value,
pointing at <code class="code">HEAP_ALLOCATED_REGION(12)</code>.  Additionally, this pointer
has the <code class="code">unchecked</code> state for the <code class="code">malloc</code> state machine
indicating it hasn&rsquo;t yet been checked against NULL since the allocation
call.
</p>
</div>
<div class="subsection-level-extent" id="Analyzer-Paths">
<h4 class="subsection"><span>27.1.5 Analyzer Paths<a class="copiable-link" href="#Analyzer-Paths"> &para;</a></span></h4>

<p>We need to explain to the user what the problem is, and to persuade them
that there really is a problem.  Hence having a <code class="code">diagnostic_path</code>
isn&rsquo;t just an incidental detail of the analyzer; it&rsquo;s required.
</p>
<p>Paths ought to be:
</p><ul class="itemize mark-bullet">
<li>interprocedurally-valid
</li><li>feasible
</li></ul>

<p>Without state-merging, all paths in the exploded graph are feasible
(in terms of constraints being satisfied).
With state-merging, paths in the exploded graph can be infeasible.
</p>
<p>We collate warnings and only emit them for the simplest path
e.g. for a bug in a utility function, with lots of routes to calling it,
we only emit the simplest path (which could be intraprocedural, if
it can be reproduced without a caller).
</p>
<p>We thus want to find the shortest feasible path through the exploded
graph from the origin to the exploded node at which the diagnostic was
saved.  Unfortunately, if we simply find the shortest such path and
check if it&rsquo;s feasible we might falsely reject the diagnostic, as there
might be a longer path that is feasible.  Examples include the cases
where the diagnostic requires us to go at least once around a loop for a
later condition to be satisfied, or where for a later condition to be
satisfied we need to enter a suite of code that the simpler path skips.
</p>
<p>We attempt to find the shortest feasible path to each diagnostic by
first constructing a &ldquo;trimmed graph&rdquo; from the exploded graph,
containing only those nodes and edges from which there are paths to
the target node, and using Dijkstra&rsquo;s algorithm to order the trimmed
nodes by minimal distance to the target.
</p>
<p>We then use a worklist to iteratively build a &ldquo;feasible graph&rdquo;
(actually a tree), capturing the pertinent state along each path, in
which every path to a &ldquo;feasible node&rdquo; is feasible by construction,
restricting ourselves to the trimmed graph to ensure we stay on target,
and ordering the worklist so that the first feasible path we find to the
target node is the shortest possible path.  Hence we start by trying the
shortest possible path, but if that fails, we explore progressively
longer paths, eventually trying iterations through loops.  The
exploration is captured in the feasible_graph, which can be dumped as a
.dot file via <samp class="option">-fdump-analyzer-feasibility</samp> to visualize the
exploration.  The indices of the feasible nodes show the order in which
they were created.  We effectively explore the tree of feasible paths in
order of shortest path until we either find a feasible path to the
target node, or hit a limit and give up.
</p>
<p>This is something of a brute-force approach, but the trimmed graph
hopefully keeps the complexity manageable.
</p>
<p>This algorithm can be disabled (for debugging purposes) via
<samp class="option">-fno-analyzer-feasibility</samp>, which simply uses the shortest path,
and notes if it is infeasible.
</p>
<p>The above gives us a shortest feasible <code class="code">exploded_path</code> through the
<code class="code">exploded_graph</code> (a list of <code class="code">exploded_edge *</code>).  We use this
<code class="code">exploded_path</code> to build a <code class="code">diagnostic_path</code> (a list of
<strong class="strong">events</strong> for the diagnostic subsystem) - specifically a
<code class="code">checker_path</code>.
</p>
<p>Having built the <code class="code">checker_path</code>, we prune it to try to eliminate
events that aren&rsquo;t relevant, to minimize how much the user has to read.
</p>
<p>After pruning, we notify each event in the path of its ID and record the
IDs of interesting events, allowing for events to refer to other events
in their descriptions.  The <code class="code">pending_diagnostic</code> class has various
vfuncs to support emitting more precise descriptions, so that e.g.
</p>
<ul class="itemize mark-bullet">
<li>a deref-of-unchecked-malloc diagnostic might use:
<div class="example smallexample">
<pre class="example-preformatted">  returning possibly-NULL pointer to 'make_obj' from 'allocator'
</pre></div>
<p>for a <code class="code">return_event</code> to make it clearer how the unchecked value moves
from callee back to caller
</p></li><li>a double-free diagnostic might use:
<div class="example smallexample">
<pre class="example-preformatted">  second 'free' here; first 'free' was at (3)
</pre></div>
<p>and a use-after-free might use
</p><div class="example smallexample">
<pre class="example-preformatted">  use after 'free' here; memory was freed at (2)
</pre></div>
</li></ul>

<p>At this point we can emit the diagnostic.
</p>
</div>
<div class="subsection-level-extent" id="Limitations">
<h4 class="subsection"><span>27.1.6 Limitations<a class="copiable-link" href="#Limitations"> &para;</a></span></h4>

<ul class="itemize mark-bullet">
<li>Only for C so far
</li><li>The implementation of call summaries is currently very simplistic.
</li><li>Lack of function pointer analysis
</li><li>The constraint-handling code assumes reflexivity in some places
(that values are equal to themselves), which is not the case for NaN.
As a simple workaround, constraints on floating-point values are
currently ignored.
</li><li>There are various other limitations in the region model (grep for TODO/xfail
in the testsuite).
</li><li>The constraint_manager&rsquo;s implementation of transitivity is currently too
expensive to enable by default and so must be manually enabled via
<samp class="option">-fanalyzer-transitivity</samp>).
</li><li>The checkers are currently hardcoded and don&rsquo;t allow for user extensibility
(e.g. adding allocate/release pairs).
</li><li>Although the analyzer&rsquo;s test suite has a proof-of-concept test case for
LTO, LTO support hasn&rsquo;t had extensive testing.  There are various
lang-specific things in the analyzer that assume C rather than LTO.
For example, SSA names are printed to the user in &ldquo;raw&rdquo; form, rather
than printing the underlying variable name.
</li></ul>

</div>
</div>
<hr>
<div class="nav-panel">
<p>
Next: <a href="Debugging-the-Analyzer.html">Debugging the Analyzer</a>, Up: <a href="Static-Analyzer.html">Static Analyzer</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Option-Index.html" title="Index" rel="index">Index</a>]</p>
</div>



</body>
</html>
